アワビデータの線形重回帰分析の手順例
Data: https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/
Abalone Data (modified: some data are replaced by N/A):
Sex / nominal / -- / M, F, and I (infant)
Length / continuous / mm / Longest shell measurement
Diameter / continuous / mm / perpendicular to length
Height / continuous / mm / with meat in shell
Whole weight / continuous / grams / whole abalone
Shucked weight / continuous / grams / weight of meat
Viscera weight / continuous / grams / gut weight (after bleeding)
Shell weight / continuous / grams / after being dried
Rings / integer / -- / +1.5 gives the age in years
import sys
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
from sklearn import preprocessing
import statsmodels.api as sm
import statsmodels.formula.api as smf
my_libs_dir = '../'
if not my_libs_dir in sys.path:
sys.path.append(my_libs_dir) # add the path to my_lib directory
# The following lines are needed to auto-reload my library file
# Without these lines, my library file is read only once and
# modifications of my library file are not reflected.
%load_ext autoreload
%autoreload 1
%aimport my_libs.linear_reg
# import from my library file
from my_libs.linear_reg import step_aic_forward, calc_vifs
%config InlineBackend.figure_formats = {'png', 'retina'} #high-reso images
font = {'family' : 'Yu Mincho'}
matplotlib.rc('font', **font)
# To show all rows and columns in the results
pd.options.display.max_rows = 999
pd.options.display.max_columns = 999
目的変数に影響を与えていそうな要因は、可能な限り網羅的に説明変数に取り入れる。
encoding='shift-jis' may be needed.
CSVファイルをチェックしてから読み込む。必要に応じて列ラベルを変更。
CSVファイルの漢字コードがShift-JISの場合は encoding='shift-jis' が必要。
csv_in = 'abalone_modified.csv'
df_all = pd.read_csv(csv_in, delimiter=',', skiprows=0, header=None)
# no header in csv, so set columns explicitly
df_all.columns = ['sex', 'len', 'd', 'h', 'w_all', 'w_meat', 'w_gut', 'w_shell', 'ring']
print(df_all.shape)
print(df_all.info())
display(df_all.head())
数値列・カテゴリー列の様子をみる
display(df_all.describe())
display(df_all.describe(exclude='number'))
欠損値を含む行を表示してみる
display(df_all[df_all.isnull().any(axis=1)])
欠損値を含む行を削除する (または欠損値を埋める)
df_all = df_all.dropna().reset_index(drop=True)
#df_all = df_all.fillna(df_all.mean()) # if you want to fill missing data instead of deleting them
print(df_all.shape)
display(df_all.head())
説明変数と目的変数を分ける
X_all_org = df_all.loc[:, 'sex':'w_shell'] # explanatory variables
y = df_all['ring'] # objective variable
print('X_all_org:', X_all_org.shape)
display(X_all_org.head())
print('y:', y.shape)
print(y.head())
ダミー変数化
X_all = pd.get_dummies(X_all_org, drop_first=True)
print('X_all:', X_all.shape)
display(X_all.head())
変数間の総当たり散布図を描画。相関係数も算出しておく
総当たりのPearson相関係数
corr_all = X_all.corr(method='pearson')
display(corr_all)
相関係数の絶対値が大きい説明変数ペアの出力
Method 1. Straight forward method
th_corr = 0.3
n_X = corr_all.shape[0]
corr_large = []
for i in range(n_X):
for j in range(i+1, n_X):
cc1 = corr_all.iat[i,j]
if cc1 < -th_corr or cc1 > th_corr:
corr_large.append([corr_all.columns[i], corr_all.columns[j], cc1])
corr_large.sort(reverse=True, key=lambda x: abs(x[2]))
display(corr_large)
Method 2. Rather tricky method ...
th_corr = 0.3
keep = np.triu(np.ones(corr_all.shape), k=1).astype('bool').flatten()
#print(np.ones(corr_all.shape)) # debug
#print(np.triu(np.ones(corr_all.shape), k=1)) # debug
#print(np.triu(np.ones(corr_all.shape), k=1).astype('bool')) # debug
#print(keep) # debug
triu = corr_all.stack()[keep]
#print(corr_all.stack()) # debug
triu_sorted = triu[ np.abs(triu).sort_values(ascending=False).index ]
#print(np.abs(triu).sort_values(ascending=False)) # debug
#print(np.abs(triu).sort_values(ascending=False).index) # debug
print(triu_sorted[ (triu_sorted < -th_corr) | (triu_sorted > th_corr) ])
変数間の総当たり散布図
scatter_matrix(X_all, figsize=(30,30))
plt.show()
seabornを使うなら
sns.pairplot(X_all)
plt.show()
Heatmapを描いてもよい
plt.figure(figsize=(10,10))
sns.heatmap(corr_all,annot=True,fmt='.2f',cmap='bwr')
plt.show()
全説明変数を用いて、標準化なしで線形重回帰分析
X_all_c = sm.add_constant(X_all)
model = sm.OLS(y, X_all_c)
results = model.fit()
print(results.summary())
#help(results)
#print(dir(results)) # To know all methods/attributes of an object
決定係数や自由度調整済み決定係数をみて、そもそも線形モデルの当てはめが妥当かどうかを判断
print('R2:', results.rsquared)
print('Adj R2:', results.rsquared_adj)
Rather good ...
重回帰式の検定 (求めた重回帰式は目的変数を説明している?)
print('p-values (F-statistic)', results.f_pvalue)
Very small p-value, so this MLR equation is considered to be significant.
Compare coefficients for explanatory variables
全説明変数と目的変数を標準化して線形重回帰分析
標準化偏回帰係数を比較
# NOTE: after scaling, X_scaled and Y_scaled are ndarray, not DataFrame.
X_scaled = preprocessing.scale(X_all)
y_scaled = preprocessing.scale(y)
model = sm.OLS(y_scaled, X_scaled)
results = model.fit()
print(results.summary())
変数選択
# NOTE: make DataFrames corresponding to X_scaled and y_scaled.
dfX_scaled = pd.DataFrame(X_scaled, columns=X_all.columns)
dfy_scaled = pd.Series(y_scaled, name=y.name)
exog = list(dfX_scaled.columns) # Initial set = all explanatory variables
endog = [dfy_scaled.name] # Objective variables
df_scaled = pd.concat([dfX_scaled, dfy_scaled], axis=1)
変数増加法による変数選択
results_aic=step_aic_forward(smf.ols, exog, endog, data=df_scaled)
print(results_aic.aic)
print(results_aic.model.exog_names)
print(results_aic.model.endog_names)
マルチコのチェック
Format of results of statsmodels: https://www.statsmodels.org/dev/generated/statsmodels.regression.linear_model.RegressionResults.html
endogs = results_aic.model.endog_names
exogs = results_aic.model.exog_names.copy()
exogs.remove('Intercept')
print(exogs) # debug
X_c = sm.add_constant(X_all[exogs])
vifs = calc_vifs(X_c, y)
display(vifs)
How to eliminate multicolinearity is case by case.
Here we simply delete three explanatory variables with high VIF.
exogs.remove('w_all')
exogs.remove('w_meat')
exogs.remove('w_shell')
print(exogs) # debug
X_c = sm.add_constant(X_all[exogs])
vifs = calc_vifs(X_c, y)
display(vifs)
For all explantory variables, VIF < 10, so we can go forward.
最終的に得られた標準化偏回帰係数から、各説明変数の目的変数に対する影響の大きさがわかる
X_final_scaled = dfX_scaled[exogs]
model_final_scaled = sm.OLS(y_scaled, X_final_scaled)
results_final_scaled = model_final_scaled.fit()
print(results_final_scaled.summary())
print(results_final_scaled.params)
The order of strengths of influences on objective variables:
d (positive) > h (positive) > w_gut (negative) > sex_I (negative)
(これが、目的変数ringに対する各説明変数(d, h, w_gut, sex_I)の影響の大きさ順)
重回帰式の検定 (求めた重回帰式は目的変数を説明している?)
print('p-values (F-statistic)', results_final_scaled.f_pvalue)
Very small p-value, so this MLR equation is considered to be significant.
選択された説明変数を用いて、標準化なしで線形重回帰分析
X_final_c = sm.add_constant(X_all[exogs])
model_final = sm.OLS(y, X_final_c)
results_final = model_final.fit()
print(results_final.summary())
決定係数や自由度調整済み決定係数をみて、線形モデルの当てはまりの良さをチェック
print('R2:', results_final.rsquared)
print('Adj R2:', results_final.rsquared_adj)
The fit of "the best model" is not good ...
最終的に得られた偏回帰係数から、「各説明変数が1増えたときの目的変数の増分」がわかる。
print(results_final.params)
Coefficients of MLR;
Increment of objective variable when corresponding variable is increased by 1
and other variables are not changed
ring $\sim$ 2.770 + 20.075 h + 13.822 d + (-5.129) w_gut + (-1.088) sex_I
NOTICE: add 1 at the head of each data for constant.
(You can use sm.add_constant() to add 1)
得られたベストモデルを用いて、予測を行う。
注: 予測のための各Xデータの先頭には定数項用の1を付加すること。
(sm.add_constant()を用いて付加してもよい)
dfX_test = pd.DataFrame([[1, 0.1, 0, 0.1, 0.1],
[1, 0.2, 1, 0.2, 0.2],
],
columns=results_final.params.index) # example
print('X for prediction:')
display(dfX_test)
y_test = results_final.predict(dfX_test)
print('Predicted y:')
print(y_test)